set.seed(0911)
library(ggplot2)
library(gridExtra)
library(cowplot)
library(plotly) # interactif plot
library(ggfortify) # diagnostic plot
library(forestmodel) # plot odd ratio
library(arm) # binnedplot diagnostic plot in GLM
library(knitr)
library(dplyr)
library(tidyverse)
library(tidymodels)
library(broom) # funtion augment to add columns to the original data that was modeled
library(effects) # plot effect of covariate/factor
library(questionr) # odd ratio
library(lmtest) # LRtest
library(survey) # Wald test
library(vcdExtra) # deviance test
library(rsample) # for data splitting
library(glmnet)
library(nnet) # multinom, glm
library(caret)
library(ROCR)
#library(PRROC) autre package pour courbe roc et courbe pr
library(ISLR) # dataset for statistical learning
ggplot2::theme_set(ggplot2::theme_light())# Set the graphical themeFor this chapter, we will use the binary.csv dataset,
which contains information regarding the admissions of students to a new
establishment.
The dataset includes:
admit), which is a factor with
two levels: 0 (not admitted) and 1
(admitted).gre), and the Grade Point Average
(gpa).rank, which has four levels (1,
2, 3 and 4). Note that a rank
1 establishment is more prestigious than a rank
2 establishment.mydata<-read.csv("binary.csv",header=T,sep=",")
mydata$admit<-factor(mydata$admit)
mydata$rank<-factor(mydata$rank)
levels(mydata$admit)[levels(mydata$admit) %in% c('0', '1')] <-c('No', 'Yes')
table(mydata$admit) %>% as.data.frame() %>% setNames(c("Admit", "Counts")) %>% kableExtra::kbl()| Admit | Counts |
|---|---|
| No | 273 |
| Yes | 127 |
Split the dataset (Train and Test)
set.seed(0911)
mydata_split <- initial_split(mydata, prop = 0.7, strata = "admit")
#mydata_split <- initial_split(mydata, prop = .7)
train <- training(mydata_split)
test <- testing(mydata_split)Selected model.
Recall that the class of positives is class 0 in our
example. The confusion matrix summurizes:
tp. Correctly predicted
positives (predicts 1=Yes and actual is 1=Yes
).tn. Correctly predicted
negatives (predicts 0=No and actual is 0=No).
fp. Incorrectly
predicted positives (predicts 1=Yes actual is
0=No). fn. Incorrectly
predicted negatives (predicts 0=No and actual is
1=Yes).The counts in the matrix depend on the threshold \(s\) used to classify probabilities estimated by the model. Adjusting \(s\) changes the sensitivity and specificity of predictions, balancing the rate of TPs against FPs.
1. Confusion Matrix Structure
| \(Y_i=0\quad\) | \(Y_i=1\quad\) | |
|---|---|---|
| \(\widehat Y_i=0\quad\) |
tn
|
fn
|
| \(\widehat Y_i=1\quad\) |
fp
|
tp
|
2. Displaying Predictions for Different Thresholds
To examine how threshold \(s\)
affects predictions, we display the predicted probabilities for
modF.
Ⓡ The following plots show the distribution of predicted probabilities split by default threshold \(s=0.5\)
Ⓡ The following plots show the distribution of predicted probabilities splitby default threshold \(s=0.5\)
Neg<-glm_probs[glm_probs$probs<.5,]
Pos<-glm_probs[glm_probs$probs>.5,]
plot_ly() %>%
add_histogram( x=~Neg,name='Negative') %>%
add_histogram( x=~Pos,name='Positive') %>%
layout(xaxis = list(title = 'Predicted probabilies'),yaxis = list(title = 'Count'))%>%
layout(legend=list(title=list(text='<b> Prediction for s=0.5 </b>')))Ⓡ Modified Threshold: \(s=0.3\) Lowering \(s\) captures more instances in the positive class, increasing sensitivity.
Neg3<-glm_probs[glm_probs$probs<.3,]
Pos3<-glm_probs[glm_probs$probs>.3,]
plot_ly() %>%
add_histogram( x=~Neg3,name='Negative') %>%
add_histogram( x=~Pos3,name='Positive') %>%
layout(xaxis = list(title = 'Predicted probabilies'),yaxis = list(title = 'Count'))%>%
layout(legend=list(title=list(text='<b> Prediction for s=0.3 </b>')))3. Confusion Matrices for Different Thresholds
Ⓡ We compute confusion matrices for \(s=0.5\) and \(s=0.3\) to see the effect of the threshold.
# s=0.5
glm_pred<-glm_probs %>% mutate(pred.5 = as.factor(ifelse(probs>.5, m1, m0)))
# s=0.3
glm_pred$pred.3<-as.factor(ifelse(glm_probs>.3,m1, m0))
glm_pred%>%rmarkdown::paged_table()Ⓡ Display the confusion matrix for the threshold by default \(s=0.5\),
CM_5<-caret::confusionMatrix(glm_pred$pred.5,train$admit,positive="Yes")$table
CM_5%>% kableExtra::kbl() %>% kableExtra::kable_styling()| No | Yes | |
|---|---|---|
| No | 172 | 60 |
| Yes | 19 | 28 |
Ⓡ For the threshold \(s=0.3\), the confusion matrix changes
CM_3<-caret::confusionMatrix(glm_pred$pred.3,train$admit,positive="Yes")$table
CM_3%>% kableExtra::kbl() %>% kableExtra::kable_styling()| No | Yes | |
|---|---|---|
| No | 124 | 30 |
| Yes | 67 | 58 |
📈 As the threshold decreases from \(s=0.5\) to \(s=0.3\):
fn and tn decrease, fp and
tp increase.🔍 How can we determine the optimal threshold? Choosing the right threshold requires an appropriate evaluation metric.
Evaluation metrics are essential for assessing the quality of a classification model. Each metric offers a unique perspective on performance, helping you choose the right model and threshold for your needs. Here are three key metrics, each answering a different question about model quality:
ppv). When the model predicts a positive, how
often is it correct?tpr). How effectively does the model identify
all actual positives?🤔 Accuracy answers the question: how often the model is right?
Accuracy (\(\text{acc}\)) measures the proportion of correct predictions out of the total predictions, providing a general sense of model effectiveness: \[ \text{acc}=\frac{\text{tp}+\text{tn}}{\text{tp}+\text{tn}+\text{fp}+\text{fn}} \]
This is the proportion of correct classifications. In Ⓡ this can be calculated as follows:
## [1] 0.7168459
📈 For a threshold of \(s=0.5\), the accuracy is 0.7168459. However, adjusting the threshold \(s\) may improve accuracy further.
Ⓡ The following code calculates and visualizes accuracy across thresholds to identify an optimal threshold:
#library(ROCR)
Acc_pred <- prediction(glm_pred$probs,train$admit)
Acc_perf <- performance(Acc_pred, 'acc')
Acc_values<-data.frame(acc_v=slot(Acc_perf,"y.values")[[1]])
Acc_values$s_acc_v<-slot(Acc_perf,"x.values")[[1]]
Acc_values%>%rmarkdown::paged_table()Ⓡ To visualize how accuracy varies with threshold \(s\):
#library(plotly)
Acc_max<-Acc_values$acc_v[which.max(Acc_values$acc_v)]
s_max<-Acc_values$s_acc_v[which.max(Acc_values$acc_v)]
plot_ly()%>%
add_segments(x = s_max, xend = s_max, y = 0, yend = Acc_max, line = list(dash = "dash", color = 'red',width = 0.5), showlegend = FALSE) %>%
add_segments(x = 0, xend = s_max, y = Acc_max, yend = Acc_max, line = list(dash = "dash", color = 'red',width = 0.5), showlegend = FALSE) %>%
add_trace(x = Acc_values$s_acc_v, y =Acc_values$acc_v, mode = 'lines', type = 'scatter')%>%
layout(yaxis = list(
title = "Accuracy"),
xaxis = list(title = "Threshold s"))%>%
layout(title = 'Accuracy at every threshold s')📈 The plot shows how accuracy can improve by adjusting the threshold. For instance, the optimal threshold is \(s\)=0.5386559 for maximum accuracy of 0.734767. This adjustment raises accuracy from 0.7168459 to 0.734767, though it may affect other metrics.
Ⓡ Updating the confusion matrix at this optimal threshold reveals changes:
glm_pred$pred.3<-NULL
glm_pred$pred.s_max_Acc<-as.factor(ifelse(glm_probs>s_max,m1,m0))
caret::confusionMatrix(glm_pred$pred.s_max_Acc,train$admit,positive="Yes")$table## Reference
## Prediction No Yes
## No 182 66
## Yes 9 22
📈 At this new threshold,
fp and tp increase while the number of
tn and fn decrease.
Pros.
Cons.
📝 Remark.
In real-world applications with imbalanced classes, such as detecting missiles, global accuracy can be misleading. Missing an actual missile (false negative) is more critical than falsely predicting one, so accuracy alone may be insufficient. Instead, Precision and Recall provide a more reliable basis for choosing a threshold in these scenarios.
🤔 Precision answers the question: When the model predicts a positive, how often is it correct?
Precision, also known as Positive Predictive Value
(ppv), measures the accuracy of positive
predictions made by a classification model. It quantifies how often the
model correctly identifies instances of the positive class.
Specifically, precision is defined as the ratio of true positives to the
sum of true positives and false positives: \[
\text{prec}=\text{ppv}=\frac{\text{tp}}{\text{tp}+\text{fp}}
\]
Example. To illustrate this concept, consider the scenario of detecting spam in an email inbox, where “spam” is designated as the positive class.
In this case, the cost of a false positive is generally higher, making it crucial to minimize false positives. Therefore, maximizing precision becomes essential to ensure the model is reliable when predicting spam.
Pros.
Cons.
📝 Remark.
Note that we can also define Negative Predictive Value (\(\text{npv}\)), which assesses the model’s performance concerning the negative class. It is compared to the estimated total negatives (\(\text{tn}+ \text{fn}\)): \[ \text{npv}=\frac{\text{tn}}{\text{tn}+\text{fn}} \]
🤔 Recall answers the question:How effectively does the model identify all actual positives?
Recall, also known as Sensitivity
or True positive rate (tpr**), quantifies
the ability of a classification model to correctly identify positive
instances from the total actual positives in the dataset. It is defined
as the ratio of true positives to the total number of actual positives:
\[
\text{rec}=\text{sens}=\text{tpr}=\frac{\text{tp}}{\text{tp}+\text{fn}}
\]
Example. To illustrate this concept, consider the case of identifying sick patients in a medical setting, where “sick” represents the positive class.
In this scenario, the cost of a false negative is particularly significant, as failing to identify a sick patient could have serious health consequences. Therefore, minimizing false negatives is essential, which directly translates to maximizing recall.
Pros.
Cons.
Prioritizes false negatives over false positives. Recall tends to treat false negatives as more costly than false positives, which might not align with all situations. While it aims to capture all positives, it may lead to an unacceptably high number of false positives if not managed carefully.
Potential for misleading performance. If the goal is to achieve “total recall,” a naive strategy would be to classify every instance as positive, resulting in 100% recall but an overwhelming number of false positives. This can render the model ineffective and diminish its practical utility.
📝 Remarks.
- The term ” Sensitivity ” is more commonly used in medical and biological research rather than machine learning.
- Additionally, we can define the False Negative Rate (FNR), which measures the proportion of actual positives that were not correctly identified \((\text{tp}+\text{fn})\) \[ \text{fnr}=\frac{\text{fn}}{\text{tp}+\text{fn}}=1-\text{rec}=1-\text{sens} \]
#pred <- prediction(glm_pred$probs,glm_pred$y)
#perf <- performance(pred, "fp","tp","fn","tn","npv","tnr")In this section, we compute and visualize Precision and Recall at various threshold values to understand how adjusting the threshold impacts model performance.
ⓇKey code \(\,\)
The following code segment
prediction() function
from the ROCRpackage.prec and recall
rec metrics using the performance() function
from the same package.pr_valuespr_values <- pr_values[-c(1),] if you need
to remove the first row of pr_values. This can be helpful for specific
visualization purposes.glm_pred$pred.s_max_Acc<-NULL
glm_pred$y<-as.factor(train$admit)
#library(ROCR)
pr_pred <- prediction(glm_pred$probs,glm_pred$y)
pr_perf <- performance(pr_pred, "rec", "prec")
pr_values<-data.frame(Threshold=slot(pr_perf,"alpha.values")[[1]])
pr_values$Recall<-slot(pr_perf,"y.values")[[1]]
pr_values$Precision<-slot(pr_perf,"x.values")[[1]]
# pr_values <- pr_values[-c(1),]
pr_values %>% rmarkdown::paged_table()Ⓡ
Next, we will visualize the relationship between Precision and Recall at at different threshold levels.
plot_ly(data = pr_values, x = ~Threshold) %>%
add_trace(y = ~Precision, mode = 'lines', name = 'Precision', type = 'scatter')%>%
add_trace(y = ~Recall, mode = 'lines', name = 'Recall', type = 'scatter')%>%
layout(title = 'Precision and Recall at every threshold s') %>%
layout(legend=list(title=list(text='<b> Metrics </b>')))Take away. Precision is a suitable metric when you are more concerned about “being right” when assigning the positive class than “detecting them all.” Conversely, the Recall metric describes the model’s ability to predict the positive class when the actual result is positive.
In Summary.
In addition to Precision and Recall, several other important metrics provide a more comprehensive evaluation of classification models. Here, we introduce two key metrics: Specificity and the False Positive Rate.
Specificity (True Negative Rate, TNR) measures the proportion of actual negatives that are correctly identified by the model. It assesses the model’s ability to avoid false positives. Specifically, it is defined as the ratio of true negatives to the total number of actual negatives \((\text{tn}+\text{fp})\): \[ \text{spec}=\text{tnr}=\frac{\text{tn}}{\text{tn}+\text{fp}} \] High specificity indicates that the model is effective in identifying negative instances, which is particularly important in scenarios where the cost of false positives is high.
False Positive Rate (FPR) quantifies the proportion of actual negatives \((\text{tn}+\text{fp})\) that are incorrectly classified as positives. It provides insight into how often the model mistakenly labels a negative instance as positive: \[ \text{fpr}=\frac{\text{fp}}{\text{tn}+\text{fp}}=1-\text{tnr}=1-\text{spec} \] A low FPR indicates that the model is reliable in minimizing the misclassification of negative instances.
Displaying the Metrics with .Ⓡ To gain a clearer understanding of these metrics, we can display all relevant metrics using the confusion matrix. This matrix provides a comprehensive overview of the model’s performance across different classifications:
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 172 60
## Yes 19 28
##
## Accuracy : 0.7168
## 95% CI : (0.6601, 0.7689)
## No Information Rate : 0.6846
## P-Value [Acc > NIR] : 0.1363
##
## Kappa : 0.2501
##
## Mcnemar's Test P-Value : 6.784e-06
##
## Sensitivity : 0.3182
## Specificity : 0.9005
## Pos Pred Value : 0.5957
## Neg Pred Value : 0.7414
## Precision : 0.5957
## Recall : 0.3182
## F1 : 0.4148
## Prevalence : 0.3154
## Detection Rate : 0.1004
## Detection Prevalence : 0.1685
## Balanced Accuracy : 0.6094
##
## 'Positive' Class : Yes
##
Key metrics interpretation.
Accuracy The model achieves an accuracy of approximately 71.68% indicating that it correctly predicts the outcome for nearly three-quarters of all instances. While this is a reasonable accuracy, further improvements may be needed, particularly in identifying positive instances. We are 95% confident that the true accuracy of the model lies within this interval. The range suggests that while the model performs well, there may be variability in its performance.
No Information Rate (NIR). (0.68) This is the accuracy one would expect by always predicting the majority class (No). The model’s accuracy is slightly above this rate.
Kappa measures agreement between predictions and actual classifications, djusted for random chance. With a value of 0.25, this suggests moderate agreement.
Mcnemar’s Test P-Value. This low p-value (6.7841153^{-6}) indicates a statistically significant difference between false positive and false negative rates.
📌 Note.
Under \(\mathcal{H}_0\): There is no significant difference between the false positive and false negative rates, meaning that classification errors occur symmetrically.
Performance metrics.
Sensitivity (Recall). The model correctly identifies around 31.82% of actual positive cases (Yes), indicating that it misses a notable number of positive instances.
Specificity.The model accurately identifies approximately 90.05% of actual negatives (No), demonstrating strong ability in avoiding false positives.
Positive Predictive Value (PPV or Precision). When the model predicts a positive (Yes), it is correct 59.57% of the time, reflecting a moderate level of precision.
Negative Predictive Value (NPV). The model correctly predicts the negative class around 74.14% of the time.
F1 Score. The F1 score balances precision and recall. With a value of 0.41, it highlights the need to improve balance between sensitivity and precision.
Prevalence. This metric indicates the proportion of actual positive instances (Yes) in the dataset, approximately 31.54%.
Detection Rate. The model successfully detects around 10.04% of actual positives, indicating a relatively low sensitivity.
Detection Prevalence. This metric reflects the proportion of instances predicted as positive among all predictions, approximately 16.85%.
Balanced Accuracy. Balanced accuracy considers both sensitivity and specificity, providing a more balanced view of performance. With 0.61, it indicates overall effectiveness.
The confusion matrix and associated metrics indicate that, while the model has high specificity, it lacks sensitivity. To improve its ability to detect positive cases without significantly increasing false positives, adjusting the classification threshold or applying other modeling techniques may be beneficial.
There are various methods to evaluate the performance of a prediction model, which allows for meaningful comparisons between different models.
Graphically, ROC curves and Precision-Recall curves are commonly employed:
📝 Remarks.
- ROC curves may present an overly optimistic view of the model’s performance on datasets with class imbalance.
- To compare models directly or evaluate their performance at different thresholds, we analyze the shapes of their curves.
The ROC curve (Receiver Operating Characteristic curve) is a useful tool for predicting the probability of a binary result. It consists in a graphical representation of the \(\text{fpr}(s)\), the “false alarm rate” (plotted on the \(x\)-axis) against the \(\text{tpr}(s)\), the “success rate”, also known as Recall or Sensitivity, (plotted on the \(y\)-axis) a range of threshold values \(s \in[0,1]\).
The shape of the curve conveys significant information:
fp and higher tn
count.tp and fewer fn count. [The North-West Rule]
Intuitively, a point \((\text{fpr}(s),\text{tpr}(s))\) of the ROC
curve representing one model is considered better than that of another
model \((\text{fpr}(s'),\text{tpr}(s'))\)
if it is northwest of it, meaning it has a higher true positive rate and
a lower false positive rate.
📝 Remarks.
- A model with perfect skill is is represented by \((\text{fpr}(s),\text{tpr}(s))=(0,1)\).
- The model such that \((\text{fpr}(s),\text{tpr}(s))=(0,0)\) systematically predicts the negative class.
- The procedure such that \((\text{fpr}(s),\text{tpr}(s))=(1,1)\) systematically predicts the positive class.
- A no-skill model is one that cannot distinguish between the classes and would predict a random class or a constant class in all cases, represented by \((\text{fpr}(s),\text{tpr}(s))= (0.5, 0.5)\).
- A no-skill model is depicted by a diagonal line from the bottom left to the top right of the plot.
Ⓡ Now, compute and visualize the ROC curve for our prediction model. First, we generate the ROC predictions and calculate the performance metrics.
#library(ROCR)
roc_pred <- prediction(glm_pred$probs,glm_pred$y)
roc_perf <- performance(pr_pred, "fpr", "tpr")
roc_values<-data.frame(Threshold=slot(roc_perf,"alpha.values")[[1]])
roc_values$TPR<-slot(roc_perf,"x.values")[[1]]
roc_values$FPR<-slot(roc_perf,"y.values")[[1]]
#pr_values<-pr_values[-c(1),]
roc_values%>%rmarkdown::paged_table()Ⓡ Then, create a plot to visualize the ROC curve, enabling us to assess how well the model discriminates between the positive and negative classes. This will assist in selecting an optimal threshold that balances the rates of false positives and false negatives effectively.
plot_ly(data = roc_values, x = ~FPR) %>%
add_trace(y = ~TPR, mode = 'lines', name = 'ROC curve', type = 'scatter')%>%
add_segments(x = 0, xend = 1, y = 0, yend = 1,name='No skill model', line = list(dash = "dash", color = 'red',width=1), showlegend = T)%>%layout(title = 'ROC curve')%>% layout(legend=list(title=list(text='<b> Legend </b>')))📝 Remark.
An operator may plot the ROC curve for the final model and choose a threshold that achieves a desirable balance between false positives and false negatives.
😕 Unbalanced classes.
The Precision-Recall curve is a valuable tool for evaluating binary classification models, particularly in the context of unbalanced datasets[1]. In many real-world scenarios, we often encounter datasets where one class (the majority class) significantly outnumbers the other (the minority class). For instance, in our analysis, class No (no event) comprises a vast majority of observations, while class Yes (event) has only a few instances. This curve is particularly useful when assessing model performance in scenarios where false positives and false negatives have different costs.
The Precision-Recall curve provides a graphical representation that plots Recall (\(\text{rec}\), sensitivity) on the \(x\)-axis against Precision (\(\text{prec}\)) on the \(y\)-axis for various threshold values.
[The North-Est Rule]
Intuitively, a point \((\text{rec}(s),\text{prec}(s))\) of the Precision-Recall curve corresponding to a given model is considered better than another model \((\text{rec}(s'),\text{prec}(s'))\) if it is located further to the northeast on the graph. In other words, it will exhibit both higher Recall and higher Precision.
📝 Remarks.
- A model with perfect skill is depicted as a point at \((\text{rec}(s),\text{prec}(s))=(1,1)\).
- A skillful model is represented by a curve that slopes toward (1,1) and lies above the flat line indicative of no skill.
- instances. This model is represented by a horizontal line at the value of the ratio of positive cases in the dataset. \[ Q=\frac{\sum_{i=n}^nY_i}{n}=:\frac{N_+}{n} \]
Ⓡ To better understand how a no-skill model behaves, we can calculate \(Q\), which reflects the distribution of positive instances in our dataset.
## [1] 0.3154122
Ⓡ Display the Prediction-recall curve
plot_ly(data = pr_values, x = ~Recall) %>%
add_trace(y = ~Precision, mode = 'lines', name = 'Precision-recall curve', type = 'scatter')%>%
add_segments(x = 0, xend = 1, y = Q, yend = Q,name='No skill model',
line = list(dash = "dash", color = 'red',width=1), showlegend = T)%>%
layout(title = 'Prediction-recall curve')%>%
layout(legend=list(title=list(text='<b> Legend </b>')))📈 The plot of the Precision-Recall curve illustrates the model’s performance, showing that it remains above the no-skill line across all thresholds, indicating that the model is capable of distinguishing the positive class effectively.
The Area Under the Curve (AUC) serves as a robust metric for summarizing the performance of a classification model across all possible thresholds. By comparing the AUC values of different models, we can assess their relative strengths and weaknesses.
AUC quantifies the model’s overall ability to discriminate between positive and negative classes. AUC values range from 0 to 1, where 0 indicates that all predictions are incorrect, and 1 indicates that all predictions are correct.
Pros.
Cons.
Ⓡ Display the ROC-AUC and PR-AUC using the following code:
#library(ROCR)
roc_auc_pred <- prediction(glm_pred$probs,glm_pred$y)
roc_auc_perf <- performance(roc_auc_pred, "auc")
roc_auc<- unlist(slot(roc_auc_perf,"y.values"))
pr_auc_pred <- prediction(glm_pred$probs,glm_pred$y)
pr_auc_perf <- performance(pr_auc_pred, "aucpr")
pr_auc<- unlist(slot(pr_auc_perf,"y.values"))
cat("roc_auc :", roc_auc,", pr_auc :",pr_auc)## roc_auc : 0.7146597 , pr_auc : 0.5271533
Interpretation.
For the ROC curve, we compare ROC-AUC with 0.5 the ROC-AUC for the no-skill model. Here, the model correctly classifies positive and negative classes 71% of the time, which is a good performance indicator. This indicates a reasonably good performance, suggesting that the model has discriminative ability and could be further optimized for improved class separation.
For the Precision-recall curve, we compare PR-AUC with \(Q\)=0.3154122, the the baseline PR-AUC for a no-skill model.
In this case, the model shows limitations in predicting positive classes, particularly in the context of class imbalance. This emphasizes that while the model performs well overall (as indicated by the ROC-AUC), it struggles significantly with accurately identifying the minority class (events denoted as Yes).
📝 Remarks: Importance of Threshold Adjustment.
- When comparing models, a higher AUC indicates better performance!
- Balance between Precision and Recall. Adjusting the threshold allows for finding an optimal balance between precision (the percentage of correct positive predictions) and recall (the percentage of true positives identified).
- Imbalanced Classes. In datasets with class imbalance, a fixed threshold (like 0.5) may lead to suboptimal performance for the minority class. For instance, a model might predict all instances as negative to maximize precision, which is not useful for detecting the positive class.
In addition to AUC metrics, another widely recognized composite score for summarizing Precision and Recall is the F1-score. This score calculates the harmonic mean of Precision and Recall (hence the term “harmonic mean” since both are rates): \[ \textbf{F1-score}=\frac{2}{\frac{1}{\text{rec}}+\frac{1}{\text{prec}}}=\frac{\text{tp}}{\text{tp}+\frac12(\text{fn}+\text{fp})} \] An F1-score of 50%, equivalent to \(\text{tp}= ( \text{fn}+\text{fp})/2\), indicates that for every correct positive prediction, the model makes two errors (either false negatives or false positives).
📝 Remarks.
- F1-score is only high if \(\text{prec}\) and \(\text{rec}\) are high. It therefore reflects a compromise between them.
- The F1-score summarizes model skill for a specific probability threshold (here \(s=0.5\)).
Ⓡ what is the F1-score of our model?
F1_score=caret::confusionMatrix(glm_pred$pred.5,glm_pred$y,mode = "everything",positive="Yes")$byClass['F1']
F1_score## F1
## 0.4148148
📈 Interpretation
📌 Context Matters An F1-score of 0.41 might be acceptable in some contexts, but in critical applications (like medical diagnosis or fraud detection), it is often deemed insufficient. Additional efforts to optimize the model, such as adjusting classification thresholds, refining the feature set, or employing advanced techniques like ensemble methods, may be necessary to improve performance.
To better understand the impact of the classification threshold on model performance, we will explore the F1-score across all possible thresholds, rather than hiding this choice behind the predict method.
[F1-score curve]
Intuitively, the F1-score curve of a perfect model would satisfy the following characteristics depending on the classification threshold:
Threshold too low. At this level, positives are correctly predicted, but negatives are incorrectly classified. As the threshold increases, the F1-score typically improves.
Optimal threshold. Here, there is perfect separation between the two classes, meaning neither positives nor negatives are misclassified. Consequently, the F1-score reaches its maximum value of 1 and remains constant.
Threshold too high. In this scenario, negatives are correctly predicted, but positives are incorrectly classified. As the threshold continues to increase, the F1-score decreases.
Ⓡ We can visualize the F1-score across different thresholds using the following code:
#library(ROCR)
F1_pred <- prediction(glm_pred$probs,glm_pred$y)
F1_perf <- performance(F1_pred, "f")
pr_values_b<-pr_values
pr_values_b$F1<-slot(F1_perf,"y.values")[[1]]
pr_values_b%>%rmarkdown::paged_table()Ⓡ The following plot will illustrate the relationship between the threshold, precision, recall, and F1-score:
#library(plotly)
plot_ly(data=pr_values_b, x = ~Threshold)%>%
add_trace(y = ~Precision, mode = 'lines', name = 'Precision', type = 'scatter',line = list(width = 1, dash ='dot'))%>%
add_trace(y = ~Recall, mode = 'line', name = 'Recall', type = 'scatter',line = list(width = 1, dash = 'dot'))%>%
add_trace( y =~F1, mode = 'lines', name = 'F1-score', type = 'scatter',line = list(width = 2))%>%
layout( xaxis = list(title = "Threshold")) %>%
layout(title = 'F1-score(s)')%>%
layout(legend=list(title=list(text='<b> Metrics </b>'))) [What is a no-skill
model?]
A no-skill model is one where positive
individuals share the same probability distribution as negative
individuals. For a dataset containing \(N_+=\sum_{i=1}^n y_i\) positives, we define
the rate of positives as \[
Q=\frac{N_+}{n}
\] Since both positive and negative individuals have the same
probability distribution, the predicted positive rate
at a threshold \(s\) is given by \[\widehat{Q}(s)=\frac{(1-s)n}{n}=1-s\] The
confusion matrix elements can be expressed as follows:
| \(Y_i=0\quad\) | \(Y_i=1\quad\) | |
|---|---|---|
| \(\widehat Y_i=0\quad\) | \(\text{tn}=(1-Q)(1-\widehat{Q}(s))\) | \(\text{fn}=Q(1-\widehat{Q}(s))\) |
| \(\widehat Y_i=1\quad\) | \(\text{fp}=(1-Q)\widehat{Q}(s)\) | \(\text{tp}=Q\widehat{Q}(s)\) |
The precision and recall can then be expressed as: \[\begin{array}{l} \text{prec}=\frac{\text{tp}}{\text{tp}+\text{fp}}=\frac{Q\widehat{Q}(s)}{\widehat{Q}(s)}=Q\\ \text{rec}=\frac{\text{tp}}{\text{tp}+\text{fn}}==\frac{Q\widehat{Q}(s)}{Q}=\widehat{Q}(s) \end{array} \] Thus, the F1-score of the no-skill model can be calculated as: \[ \textbf{F1-score}(s)=\frac{2}{\frac{1}{\text{rec}}+\frac{1}{\text{prec}}}=\frac{2}{\frac{1}{Q}+\frac{1}{\widehat{Q}(s)}}=\frac{2Q\widehat{Q}(s)}{Q+\widehat{Q}(s)} \]
Ⓡ The following code calculates the success rate for a no-skill model and adds it to the data for comparison with the F1-score:
Q<-sum((table(glm_pred$y)[2]))/length(glm_pred$y)
hat_Q<-1-pr_values_b$Threshold
pr_values_b$no_skill<-(2*Q*(1-pr_values_b$Threshold))/(Q+(1-pr_values_b$Threshold))
pr_values_b%>%rmarkdown::paged_table()Ⓡ To visualize the F1-score against the no-skill model, we can use:
plot_ly(data=pr_values_b, x = ~Threshold)%>%
add_trace( y =~F1, mode = 'lines', name = 'F1-score', type = 'scatter',line = list(width = 2))%>%
add_trace( y =~no_skill, mode = 'lines', name = 'No-skill', type = 'scatter',line = list(width = 1, dash = 'dot'))%>%
layout( xaxis = list(title = "Threshold")) %>%
layout(title = 'F1-score(s) vs No-skill')%>%
layout(legend=list(title=list(text='<b> Curves </b>')))📈 Interpretation
The observed intersection of the F1-score and no-skill curves at \(s\approx 0.51\) provides insight into the model’s performance across different thresholds. Initially, as the threshold varies from low to moderate values, the F1-score curve is above the no-skill line. This indicates that the model has skill in distinguishing positive from negative cases better than a random classifier.
However, once the threshold increases past around \(s\approx 0.51\), the F1-score drops below the no-skill line. This shift implies that at higher thresholds, the model struggles to maintain a balance between precision and recall and begins to perform worse than the no-skill benchmark. This threshold of \(s\approx 0.51\) is a key point, suggesting that for this model, using higher thresholds may significantly decrease its effectiveness in identifying true positives without introducing many errors (false positives and false negatives).
Analyzing extreme thresholds allows us to examine the model’s performance in edge cases, specifically when the classification threshold ss is set to 0 or 1.
When \(s=1\) , the model classifies all observations as negative, resulting in no observations being predicted as positive, so \(\widehat{Q}(1)=0\). In this case, the F1-score is defined to be 0, as the model detects no positive cases.
When \(s=0\) the model classifies all observations as positive, meaning the predicted positive rate, denoted as \(\widehat{Q}(0)\), is equal to 1: \[ \textbf{F1-score}(0)=\frac{2Q\widehat{Q}(0)}{Q+\widehat{Q}(0)}=\frac{2Q}{Q+1} \]
## [1] 0.479564
📈 This value suggests that when the model predicts every observation as positive, it achieves a moderate balance between precision and recall, primarily driven by the proportion of positive cases in the dataset (\(Q\)). There’s room for improvement in the model’s performance.
Which threshold maximises the F1-score ?
To determine the threshold that maximizes the F1-score, we can extract the maximum value of the F1-score from our calculated values and identify the corresponding threshold.
Ⓡ The code below performs this calculation:
F1_max<-pr_values_b$F1[which.max(pr_values_b$F1)]
s_F1_max<-pr_values_b$Threshold[which.max(pr_values_b$F1)]
c(s_F1_max,F1_max)## [1] 0.2849827 0.5789474
This indicates that at a threshold of approximately 0.28, the model achieves its highest F1-score of about 0.58.
Despite an overall PR AUC of `r round(pr_auc,2), indicating some level of predictive ability, there remains significant potential for improving model performance. One approach to enhance performance is by employing the F-beta score, which extends the F1-score by introducing a weighting mechanism for precision and recall. This allows us to tailor the model evaluation to specific contexts, depending on whether precision or recall is more critical for the task at hand.
The F-beta score is defined as follows: \[ \textbf{Fbeta-score}(s)=\frac{1+\beta^2}{\frac{\beta^2}{\text{rec}}+\frac{1}{\text{prec}}} \]
Ⓡ Calculation of F-beta Scores. To explore how different values of \(\beta\) impact the F-beta scores, we compute F-beta scores for specific weights. Below, we calculate F-beta scores for \(\beta=0.4\) and \(\beta=0.6\):
beta<-0.4
pr_values_b$Fbeta.4<-(1+beta^2)/((beta^2/(pr_values_b$Recall))+(1/(pr_values_b$Precision)))
beta<-0.6
pr_values_b$Fbeta.6<-(1+beta^2)/((beta^2/(pr_values_b$Recall))+(1/(pr_values_b$Precision)))Ⓡ Visualization of F-beta Scores. The graph below illustrates the F1-score and F-beta scores for various threshold values. This allows us to visualize how the choice of \(\beta\) influences the model’s performance and highlights the areas where improvements can be made.
plot_ly(data=pr_values_b, x = ~Threshold)%>%
add_trace(y = ~F1, mode = 'line', name = 'beta=0.5', type = 'scatter',line = list(width = 1, dash = 'dot'))%>%
add_trace( y =~pr_values_b$Fbeta.6, mode = 'lines', name = "beta=0.6", type = 'scatter',line = list(width = 2))%>%
add_trace( y =~pr_values_b$Fbeta.4, mode = 'lines', name = 'beta=0.4', type = 'scatter',line = list(width = 2))%>%
layout(title = 'F-beta-score(s) for different values of beta ')%>% layout(legend=list(title=list(text='<b> beta </b>')))%>%
layout(yaxis = list(title = 'Fbeta-score(s)'),xaxis = list(title = 'Threshold(s)'))Insights from the F-beta Scores
The F-beta scores provide a nuanced perspective on model performance beyond the PR AUC. By adjusting the value of \(\beta\), we can focus on improving either precision or recall according to the specific needs of the application. This targeted approach can lead to more effective model enhancements, especially in domains where the cost of false negatives or false positives carries different weights.
In conclusion, utilizing F-beta scores allows us to strategically navigate the trade-offs between precision and recall, ultimately striving for a model that not only performs well overall but also aligns with the specific requirements of the task at hand.
[1] see for more details “The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary models on Imbalanced Datasets”, Takaya Saito & Marc Rehmsmeier (2015) ( lien vers le papier ).